New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Allow batching the output of a join #2310

Merged

revans2 merged 10 commits into NVIDIA:branch-0.6 from revans2:initial_out_of_core_join

May 3, 2021

+1,193 −397

Collaborator

revans2 commented Apr 29, 2021

This is the first step for out of core join. This at least partially addresses #20

This depends on rapidsai/cudf#8118 to go in first.

For most cases that I have tested this is strictly better than what was before. If the output of the join fits in the output batch size then the join will happen just like it does today. If the output is larger than that we now can output it in multiple batches. The problem that I have found is that the gather map is not spillable and after a single batch is output the GPU Semaphore is released. This means that for contrived joins that explode evenly, each active task will have a potentially large gather map in memory. I think I can make it spillable without a lot of work. If I can then I might just do it. But I also want to spend some time running benchmarks to see if this can help fix some of the exploding join issues have have seen there.


          Allow batching the output of a join

ee51ac9

Signed-off-by: Robert (Bobby) Evans <[email protected]>

revans2 added feature request cudf_dependency labels

revans2 added this to the Apr 26 - May 7 milestone

revans2 self-assigned this

revans2 commented

View reviewed changes

Collaborator Author

revans2 left a comment

I'll try to add in some more java docs too.

integration_tests/src/main/python/asserts.py Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

Collaborator Author

revans2 commented Apr 30, 2021

I have been adding in spilling for the gather maps which let me push things a bit further and found a bug in the gather implementation.

rapidsai/cudf#8121


          Allow spilling of gather maps

7483cc2

abellina requested changes

View reviewed changes

Collaborator

abellina left a comment

First pass @revans2

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

jlowe reviewed

View reviewed changes

...park300/src/main/scala/com/nvidia/spark/rapids/shims/spark300/GpuBroadcastHashJoinExec.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

revans2 added 3 commits

April 30, 2021 15:16


          Some fixes and cleanup

a39ff12


          Some better profiling

0ecd1b7


          Addressed some review comments

b6a0c79

abellina requested changes

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuBoundAttribute.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Show resolved Hide resolved

revans2 added 2 commits

May 1, 2021 09:31


          Refactored JoinGatherer to new file and cleanup of ownership

43fe631


          More code cleanup

3872ed4

Collaborator Author

revans2 commented May 1, 2021

I think I have addressed all of the review comments. I put together my own quick hack for rapidsai/cudf#8121 and it is not enough to be able to run q72. Even at a batch size of 26m and 200 partitions it took over 6 mins to finish one of the join tasks our of 200 and failed on the next one. We are going to have to really think about what we want to try and do to support query 72. But all of the others run with reasonable configurations.

jlowe reviewed

View reviewed changes

...park300/src/main/scala/com/nvidia/spark/rapids/shims/spark300/GpuBroadcastHashJoinExec.scala Show resolved Hide resolved

revans2 marked this pull request as ready for review

May 1, 2021 22:45


          Addressed review comments

50f2580

jlowe reviewed

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffledHashJoinBase.scala Show resolved Hide resolved

abellina reviewed

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuHashJoin.scala Show resolved Hide resolved

revans2 added 2 commits

May 3, 2021 10:09


          Merge branch 'branch-0.6' into initial_out_of_core_join

0a4a024

Tests on not passing with struct joins need to do some more debugging


          Cleanup and fixes

464a313

Collaborator Author

revans2 commented May 3, 2021

build

Collaborator Author

revans2 commented May 3, 2021

I upmerged and had to update the code for the new struct join support. A good thing too because it exposed a bug in my filtering code. It would only have been a performance regression before the struct code, but afterwards it became an error. This should be all ready to go now. The dependency is merged.

jlowe approved these changes

View reviewed changes

abellina approved these changes

View reviewed changes

revans2 merged commit 8acac67 into NVIDIA:branch-0.6

revans2 deleted the initial_out_of_core_join branch

May 3, 2021 18:25

This was referenced May 3, 2021

[BUG] TPC-ds 14a and 14b failed to run #650

Closed

[BUG] TPC-DS-like query 24a and 24b at scale=3TB fails with OOM #1628

Closed

[BUG] TPC-DS-like query 95 at scale=3TB fails with OOM #1630

Closed

[BUG] TPC-DS-like query 72 at scale=3TB fails with "Maximum join output size exceeded" #1629

Closed

sameerz mentioned this pull request

[FEA] Support out of core joins #20

Closed

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request


          Allow batching the output of a join (NVIDIA#2310)

6465e51

Signed-off-by: Robert (Bobby) Evans <[email protected]>

nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request


          Allow batching the output of a join (NVIDIA#2310)

1d03487

Signed-off-by: Robert (Bobby) Evans <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cudf_dependency feature request